Central auditory test performance predicts future neurocognitive function in children living with and without HIV

Tests of the brain’s ability to process complex sounds (central auditory tests) correlate with overall measures of neurocognitive performance. In the low- middle-income countries where resources to conduct detailed cognitive testing is limited, tests that assess the central auditory system may provide a novel and useful way to track neurocognitive performance. This could be particularly useful for children living with HIV (CLWH). To evaluate this, we administered central auditory tests to CLWH and children living without HIV and examined whether central auditory tests given early in a child’s life could predict later neurocognitive performance. We used a machine learning technique to incorporate factors known to affect performance on neurocognitive tests, such as education. The results show that central auditory tests are useful predictors of neurocognitive performance and perform as well or in some cases better than factors such as education. Central auditory tests may offer an objective way to track neurocognitive performance in CLWH.

processing, CATs go beyond the peripheral auditory system and require rapid cortical processing, concentration, attention, and integration of executive function within the brain.Measures of central auditory processing correlate with neurocognitive function in people living with HIV 13 .Niemczak et al. 13 showed that central auditory tests (CATs) are positively associated with both learning and working memory measures.Furthermore, recent research shows that electrophysiological measures of sound processing [e.g.Frequency Following Response (FFR)] can be used as markers of central nervous system dysfunction in CLWH (Ealer et al., Accepted at AIDS).These results suggest CATs might be useful for tracking or predicting neurocognitive function.
Because the interaction of multiple factors can affect neurocognitive function, a machine learning approach to prediction may be useful.Machine learning allows multiple factors to be considered to improve overall predictive abilities from a diverse data set.To understand whether CATs in addition to other factors such as education can predict neurocognitive function, machine learning models such as random forest (RF), eXtreme Gradient Boosting (XGBoost), and support vector machines (SVM) may provide an improved analytical method [14][15][16] .Machine learning models consistently yield better predictive performances than traditional statistical models with biomedical data or other data with high levels of complexity [17][18][19] .High predictability machine learning models can detect predictor-outcome dependencies that traditional statistical models may fail to detect.Furthermore, machine learning models are better suited to learn nonlinear trends and patterns in the longitudinal predictor-outcome relationship when hyperparameters (i.e., model settings) are appropriately tuned during model training.These models have not been previously used to assess the predictive ability of central auditory function for future cognitive development in children.
In this study, we examined whether CATs early in childhood could predict cognitive performance at a later time point.We used longitudinal CAT scores and state-of-the-art machine learning models to examine the predictive ability of CATs for neurocognitive deficits in children living with and without HIV.Specifically, we investigated the ability of CAT performance to predict performance on Leiter-3 neurocognitive test composites (Nonverbal IQ, Processing Speed, and Nonverbal Attention/Memory) administered 0.5-1.5 years later.We used longitudinal data collected from children in Dar es Salaam, Tanzania.We hypothesized that predictors involving CATs would yield useful predictions of subsequent Leiter-3 composite performance.

Predicting neurocognitive function
We used data from children between the ages of 3-10 years who were part of a longitudinal study of HIV and neurocognitive performance in Dar es Salaam, Tanzania.Children came at 6-month to 1-year intervals to have measurements of both central auditory function using CATs and cognitive performance using the Leiter-3.CATs could be either behavioral (the child heard a sound and gave a response) or electrophysiological (electrical activity of populations of neurons in the auditory pathway evoked by an acoustic stimulus).Children with less than three visits and/or missing CATs and Leiter-3 scores were excluded from the analysis.We used 7 different sets of variables to predict neurocognitive function as measured by the Leiter-3 (see "Methods" section for details).The performance metrics used were area under the receiver operating curve (AUROC/AUC) and F1-score.Rather than looking at model performance at various sensitivity and specificity thresholds, the AUC provides a single summary measure of a model's overall predictive ability across all possible sensitivity and specificity thresholds.Accuracy is often used when there is balance in the distribution of outcome classes and it is simply the proportion of correct predictions.In our case a class imbalance exists, therefore it is more statistically appropriate to use F1-score as a model performance metric.The F1-score accounts for the fact that the minority class (class with smaller sample) may be harder to detect than the majority class.Consequently, it provides a more balanced measure of a model's performance across all classes than accuracy does in the presence of class imbalance.Both metrics are presented on a 0-1 scale with higher values (closer to 1) being best.Four machine learning models were used: Random forest, eXtreme Gradient Boosting, Logistic regression, and Support Vector Machines.
We used the arithmetic mean (m) and standard deviation (sd) of F1 and AUC scores across the four machine learning models (see "Methods" section) to determine which sets of predictors performed best and were most stable.For Nonverbal IQ, the models that included electrophysiological CATs and covariates as predictors yielded the highest and most stable predictive performances across the four models (m F1 = 0.72 ± 0.04, m AUC = 0.64 ± 0.06).For Processing Speed, the models that included the behavioral CATs yielded the highest and most stable results across the models (m F1 = 0.72 ± 0.07, m AUC = 0.65 ± 0.07).

Discussion
As expected, factors such as education and HIV were useful for predicting performance on neurocognitive tests.The important finding, however, was that CATs alone were often the best predictor of subsequent poor neurocognitive function.This is interesting because CATs and the Leiter-3 differ substantially in how they engage the brain.The Leiter-3 is a totally non-verbal test.Instructions are given via gesture and facial expressions and no information is provided using speech.CATs, by contrast, test the underlying processes involved in processing The results also suggest CATs may be useful as an early predictor.CATs taken early in this longitudinal study predicted cognitive performance later on.While these results are preliminary, they present the possibility that CATs could be used early in a child's development and perhaps tracked over time to obtain information on how HIV may be affecting the brain.By combining CAT results with environmental and social factors (including HIV and education) at young ages and using the predictive ability of readily available machine learning models we can reach better predictions of who may have lower neurocognitive performance at later ages in children with HIV.This is of high importance given the aforementioned findings that CLWH have higher risk of neurocognitive deficits.Implementing standardized testing at any age is challenging, particularly in LMICs.There are always cultural elements to consider, including the novelty of testing and the influence of normative differences between how tests are developed in the West and the cultural context in which they are being used.CATs can be completed with less influence from culture, while providing a strong measure of brain function hence they are crucial in improving predictions of neurocognitive performance in the LMCIs.Having tools that can predict functional problems in the future, and at such a young age, can lead to more rapid identification of developmental concerns and allow for earlier intervention to promote better outcomes 20 .This would be a significant improvement over trying to perform detailed neurocognitive testing.
A limitation of our approach is we observed instability in the performance of the machine learning models.For instance, performance varied from model to model for certain predictors (XGBoost − AUC = 0.73 to SVM − AUC = 0.53 for behavioral CATs only as predictors with Nonverbal IQ as the outcome).This is likely a consequence of limited sample size and/or low representation of the below age-based expectation class (the neurocognitive underperformers) in the testing set.With low representation of the below age-based expectation class in the testing set, a single misclassification or a correct classification is enough to significantly increase or decrease the predictive performance of a model.We also observed CATs did not perform well predicting the Nonverbal Memory composite performance.It is challenging to discern why that is the case.It may be due to the nature of the specific tasks in the Nonverbal Memory composite, which rely on higher-order processes (e.g., working memory, cognitive flexibility) that are less apparent in the Nonverbal IQ or Processing Speed composites.In this sense, CATs may align best with general cognitive reasoning and speed of information processing.Another challenging aspect of the analytical phase of our work is the low number of predictors in the data.Machine learning models reach their full predictive potential in the presence of a large number of predictors where they can find high order levels of interplay between those predictors.With a low number of predictors in the data we also run the risk of creating correlated trees in the random forest models which would cause overfitting of the models on the training set and subsequent underperformance on the testing set.Further investigation is needed on larger data to confirm our findings and obtain more consistent and stable predictive performances on our models.Additionally, we may need to follow children for a longer period of time to establish early CATs as true and meaningful predictors of later neurocognitive function.
In summary, we built machine learning models that used CATs and other demographics such as to predict neurocognitive function in CLWH and healthy controls with high accuracy.In multiple instances, models that contained CATs as predictors outperformed models that contained covariates only as predictors.We conclude that both behavioral and electrophysiological CATs have promise as predictors of neurocognitive function.

Participants and data
The data for CLWH and HIV-negative children were collected as part of an on-going longitudinal study in Dar es Salaam, Tanzania.Our research protocol was approved by the Committee for Protection of Human Subjects Participants were recruited from local pediatric programs, district hospitals, and schools.Informed consent was obtained for all minors in this study from a parent and/or a legal guardian.As part of the study, participants attended bi-annual follow-ups up to age 6 and then annually after that.At each visit peripheral auditory function, CATs (behavioral and electrophysiological), and the Leiter-3 were collected.

Inclusion criteria
Data for 109 children were selected from a dataset using the following criteria.At the time of enrollment children were 3-11 years old, had normal hearing in both ears (i.e., ≤ 25 dB HL at 0.5, 1.0, 2.0, and 4.0 kHz), no history of exposure to traumatic noise, and normal tympanometry.Children with a history of mental illness, neurological disease, or loss of consciousness were excluded, as these factors impact performance on CATs [21][22][23] .HIV status was confirmed in children using medical records or a rapid HIV test and reconfirmed using an ELISA assay.Children with 3 or more visits were included in the analysis.Children with missing scores for more than one CAT, more than two electrophysiological CATs, or Leiter-3 were excluded from the analysis, as missing data can impact the predictive abilities of machine learning models, especially in small datasets 24 .

Audiometry
Pure-tone thresholds were collected in all children for 0.5, 1.0, 2.0, 4.0, 6.0, and 8.0 kHz using a Békésy-like tracking and Modified Hughson Westlake procedures (see Niemczak et al. 25 for details).Thresholds of 25 dB HL or higher for each ear were considered abnormal.Pure tone average (PTA) was calculated by averaging thresholds from 0.5 to 4.0 kHz.

Behavioral central auditory tests
The Hearing in Noise Test (HINT), Triple Digit Test (TDT), and Staggered Spondaic Words Test (SSW) were used to measure central auditory processing in children (for details see Niemczak et al. 13 ) 11 .HINT and TDT are used to assess one's ability to perceive and process speech in noise, while SSW measures dichotic processing 26 .All tests were administered and presented in the Kiswahili language.

Electrophysiological tests
The acoustic brainstem response (ABR) followed a similar methodology to Niemczak et al. 25 .An Intelligent Hearing Systems SmartEP (Miami, FL) was used to record ABR measurements 100 μs rarefaction clicks presented at a rate of 21.1/s (slow) or 61.1/s (fast) at 80 dB sound pressure level to the right ear.The electrode montage consisted of the right earlobe as the inverting, ground at F pz , and the high forehead at F z serving as the non-inverting electrode.Two repetitions of each click were recorded and averaged (total 2000 sweeps).Responses were filtered from 0.1 to 1.5 kHz.The absolute latencies and amplitudes of waves, I, III, and V were measured from baseline.
The frequency following response (FFR) was evoked using the /ba/ syllable, collected from all subjects using the same hardware as the ABR.The collection of the FFR has been described in-detail elsewhere 27 and methodology follows Ealer et al. (in review).Stimuli were played monaurally to the right ear at 80 dB HL at a rate of 4.35 per second.Two runs of 3000 artifact-free responses were collected and responses were then offline filtered from 0.7 to 2 kHz.The /ba/ was primarily analyzed at the vowel region of the stimulus (i.e., the /a/), which is spectrotemporally static.The /ba/ stimulus was 180 ms in duration, and the vowel region was from 60 to 180 ms.The FFR was recorded with alternating polarity.When analyzing the FFR, we added these polarities together (added condition) or subtract them from one another (subtracted condition) to emphasize lower frequency information (i.e., fundamental frequency) or higher frequency information (i.e., formants), respectively.

Leiter-3
The Leiter International Performance Scales-Third Edition (Leiter-3).The Leiter-3 assesses neurocognitive functioning in children and adults from 3 to 75 years of age.Domains measured include fluid and categorical reasoning, visual identification, and mental sequencing.The test is entirely nonverbal, with instructions delivered via gestures and pantomime.Participants provide responses via pointing, block or manipulative placement, and paper-and-pencil task completion.We have demonstrated the feasibility and acceptability of the Leiter-3 in Tanzania 6,28 (see Lichtenstein et al. 6 for a discussion of training methods and procedures.)Predictors Predictors were extracted from early visits of all participants (see Table 1).We defined early as the visits prior to or at the median age of a child during their participation in the study.Gender, HIV status, the maximum number of years of education (i.e., years of education hereafter), and the best score for each behavioral and electrophysiological CAT during early visits were extracted.We defined gender, HIV, and years of education as covariates.We reason that maximum years of education and best CATs' scores should be used to predict best performance on the Leiter-3 composites.Age was excluded as a variable as it is highly correlated with years of education (r = 0.83).Including age would introduce correlation bias to trees-based models 29 .Using our predictors, we constructed 7 models with a variety of variable combinations: (1) covariates alone, (2) behavioral CATs, (3) electrophysiological CATs, (4) behavioral and electrophysiological CATs, (5) covariates and behavioral CATs, (6) covariates and electrophysiological CATs, and (7) covariates with behavioral and electrophysiological CATs.

Table 1 .
Summary of distribution of demographic and central auditory tests data for HIV+ and HIV− children.The first p value assesses the distribution of gender across HIV groups while the remaining is the result comparing the mean of each variable by HIV group.

Table 2 .
Results for predicting Nonverbal IQ.There were 24 children in the below age-based expectations category for the training set, and 8 children in the testing set.The top number in each box is the F1-score and the bottom number is the AUC.

Table 4 .
Results for predicting Nonverbal Attention/Memory.There were 17 children in the below age-based expectations category for the training set, and 6 children in the testing set.Dartmouth College and the Research Ethics Committee of Muhimbili University of Health and Allied Sciences and all methodologies steps were conducted in accordance with the relevant guidelines and regulations.